Windows Azure Queue Overview

11/25/2010 4:11:41 PM

Up to this point, the discussion has been geared toward convincing you that queues are the next best thing to sliced bread (and YouTube). Now, let’s look at what Windows Azure queues are, what they provide, and how to use them.

1. Architecture and Data Model

The Windows Azure queue storage service is composed of two main resources in its data model:

Queues: A queue contains many messages, and a Windows Azure storage account can contain any number of queues. There is no limit to the number of messages that can be stored in any individual queue. However, there is a time limit: messages can be stored for only a week before they are deleted. Windows Azure queues aren’t meant for long-lived messages, though, so any message that sticks around for such a long period of time is probably a bug.
Message: A message is a small piece of data up to 8 KB in size. It is added to the queue using a REST API and delivered to a receiver. Note that, although you can store data in any format, the receiver will see it as base64-encoded data. Every message also has some special properties, as shown in Table 1.

Table 1. Message properties
Name	Description
MessageID	This is a GUID that uniquely identifies the message within the queue.
VisibilityTimeout	This property specifies the exact opposite of what its name suggests. This value determines for how long a message will be invisible (that is, of course, not visible) when it’s removed from the queue. (You’ll learn how to use this to protect from crashing servers later in this chapter.) After this time has elapsed, if the message hasn’t been deleted, it’ll show up in the queue again for any consumer. By default, this is 30 seconds.
PopReceipt	The server gives the receiver a unique PopReceipt when a message is retrieved. This must be used in conjunction with a MessageId to permanently delete a message.
MessageTTL	This specifies the Time to Live (TTL) in seconds for a message. Note that the default and the maximum are the same: seven days. If the message hasn’t been deleted from the queue in seven days, the system will lazily garbage-collect and delete it.

2. The Life of a Message

The life of a message is a short (but exciting and fruitful) one. Windows Azure queues deliver at-least-once semantics. In other words, Windows Azure queues try really, really hard to ensure that someone reads and finishes processing a message at least once. However, they don’t guarantee that someone won’t see a message more than once (no “at-most-once” semantics), nor that the messages will appear in order.

Figure 1 shows the life of a message.

The typical flow is something like the following:

A producer adds a message to the queue, with an optional TTL.
The queue system waits for a consumer to take the message off the queue. Regardless of what happens, if the message is on the queue for longer than the TTL, the message gets deleted.
A consumer takes the message from the queue and starts processing it. When the consumer takes the message off the queue, Windows Azure queues make the message invisible. Note that they are not deleted. Windows Azure just flips a bit somewhere to note that it shouldn’t return the message to consumers for the time being. The consumer is given a PopReceipt. This is unique for every time a consumer takes a message off the queue. If a consumer takes the message multiple times, it’ll get a different PopReceipt every time. This entire step is where things get interesting. Two scenarios can play out here:
1. In the first scenario, the consumer finishes processing the message successfully. The consumer can then tell the queue to delete the message using the PopReceipt and MessageId. This is basically you telling the queue, “Hey, I’m done processing. Nuke this message.”
2. In the second scenario, the consumer crashes or loses connectivity while processing the message. As noted earlier, this can happen often in distributed services. You don’t want queue messages to go unprocessed. This is where the invisibility and the iVisibilityTimeout kick in. Windows Azure queues wait the number of seconds specified by VisibilityTimeout, and then say, “Hmm, this message hasn’t been deleted yet. The consumer probably crashed—I’m going to make this message visible again.” At this point, the message reappears on the queue, ready to be processed by another consumer or the same consumer. Note that the original crashing consumer could come back online and delete the message—Windows Azure queues are smart enough to reconcile both of these events and delete the message from the queue.

Figure 1. Message life cycle

Picking the right VisibilityTimeout value depends on your application. Pick a number that is too small and the message could show up on the queue before your consumer has had a chance to finish processing. Pick a timeout that is too large and processing the work item could take a long time in case of a failure. This is one area where you should experiment to see what number works for you.

In the real world, step 3b will see a different consumer pick up and process the message to completion, while the first crashing consumer is resurrected. Using this two-phase model to delete messages, Windows Azure queues ensure that every message gets processed at least once.

Note:

One interesting issue that occurs when messages get redelivered on crashing receivers has to do with poison messages. Imagine a message that maliciously or nonmaliciously causes a bug in your code, and causes a crash. Since the message won’t be deleted, it’ll show up in the queue again, and cause another crash…and another crash…and over and over. Since it stays invisible for a short period of time, this effect can go unnoticed for a long period of time, and cause significant availability issues for your service. Protecting against poison messages is simple: get the security basics right, and ensure that your worker process is resilient to bad input.

Poison messages will eventually leave your system when their TTL is over. This could be an argument for making your TTLs shorter to reduce the impact of bad messages. Of course, you’ll have to weigh that against the risk of losing messages if your receivers don’t process messages quickly enough.

3. Queue Usage Considerations

Windows Azure queues trip up people because they expect the service to be just like MSMQ, SQL Service Broker, or <insert-any-common-messaging-system>—and it isn’t. You should be aware of some common “gotchas” when using Windows Azure queues. Note that these really aren’t defects in the system—they’re part of the package when dealing with highly scalable and reliable distributed services. Some things just work differently in the cloud.

3.1. Messages can be repeated (idempotency)

It is important that your code be idempotent when it comes to processing queue messages. In other words, your code should be able to receive the same message multiple times, and the result shouldn’t be any different. There are several ways to accomplish this.

One way is to just do the work over and over again—transcoding the same video a few times doesn’t really matter in the big picture. In other cases, you may not want to process the same transaction repeatedly (for example, anything to do with financial transactions). Here, the right thing to do is to keep some state somewhere to indicate that the operation has been completed, and to check that state before performing that operation again. For example, if you’re processing a payment, check whether that specific credit card transaction has already happened.

3.2. Messages can show up out of order

This possibility trips up people since they expect a system called a “queue” to always show first-in, first-out (FIFO) characteristics. However, this isn’t easily possible in a large distributed system, so messages can show up out of order once in a while. One good way to ensure that you process messages in order is to attach an increasing ID to every message, and reject messages that skip ahead.

3.3. Time skew/late delivery

Time skew and late delivery are two different issues, but they are related because they have to do with timing. When using Windows Azure worker roles to process queue messages, you shouldn’t rely on the clocks being in sync. In the cloud, clocks can drift up to a minute, and any code that relies on specific timestamps to process messages must take this into account.

Another issue is late delivery. When you place a message onto the queue, it may not show up for the receiver for some time. Your application shouldn’t depend on the receiver instantly getting to view the message.